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ABSTRACT 



A method and apparatus for detecting and tolerating situa- 
tions in which one or more processors in a multi-processor 
system cannot participate in timer-driven or timer-triggered 
protocols or event sequences. The multi-processor system 
includes multiple processors each having a respective 
memory. These processors are coupled by an inter-processor 
communication network (preferably consisting of redundant 
paths). 

Processors are suspected of having failed (ceased opera- 
tions) outright or having a failed timer mechanism when 
other processors detect the absence of periodic "IamAlive" 
messages from other processors. When this happens, all of 
the processors in the system are subjected to a series of 
stages in which they repeatedly broadcast their status and 
their connectivity to each other. During the first such stage, 
according to the present invention, a processor will not 
assert its ability to participate unless its timer mechanism is 
working. It arms a timer expiration event and does not assert 
its health until and unless that timer expiration event occurs. 
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METHOD AN APPARATUS FOR In older systems, before regrouping was implemented, the 

TOLERANCE OF LOST TIMER TICKS following could occur when the second processor then sent 

DURING RECOVERY OF A MULXI- a packet to the first. The first processor judged the second to 

PROCESSOR SYSTEM be functioning improperly and responded with a poison 

5 packet. The first processor ignored the content of the packet 

This invention relates generally to fault-tolerant multi- from the second, 

processor systems. In particular, this invention relates to Ultimately, many or all of the other processors could end 

methods for improving the resilience of a multiprocessor up ignoring the affected processor (except to try to stop it), 

system in the face of the failure of periodic or timed The affected processor was, in effect, outside of the system 

activities on a constituent processor. io and functioning as if it were an independent system. This 

condition was sometimes called the split-brain problem. 

RELATED PATENT APPLICATIONS Without regroU pi n g ? me following situations can occur: 

U.S. patent application Ser. No. 08/265,585 entitled, Bom of me processes m a prc>xss pair nmnmg on different 

"Method and Apparatus for Fault-Tolerant Multi-processing processors can regard themselves as the primary, destroying 

System Recovery from Power Failure or Drop-Outs," filed mc abllit y to Perform backup functions and possibly cor- 

Jun. 23, 1994, naming as inventors Robert L. Jardine, nipting files. All system processors can become trapped in 

Richard M. Collins and Larry D. Reeves, under an obliga- infinite loops, contending for common resources. System 

tion of assignment to the assignee of this invention. ^hles can become corrupted. 

U.S. patent application Ser. No. 08/487,941 entitled, "A 20 ^grouping supplements the IamAlive/poison packet 

Method to Improve Tolerance of Non-Homogeneous Kwer method ' Regn>upmg uses a voting algorithm to determine 

Outages," filed Jun. 7, 1995, naming as inventors Robert L. true state of ^ch processor in the system Each processor 

Jardine, Richard M. Collins and A. Richard Zacher, under an volunteers its record of the state of all other processors 

obligation of assignment to the assignee of this invention. compares its record with record^ from pother processors and 

, TO ,. . . , , updates its record accordingly. When the voting is complete, 

«y ?J ,ate ?-f PpLcaU ° n , Se c ?° D 08n ^ 6Db ! led ' 25 JT processors have the same record of the s^tem's state! 

Method and Apparatus for Spht-Brain Avoidance in a ^ processors ^ have coor dinated among themselves to 

Multi-Processor System, filed on the same date as the rein tegrate functional but previously isolated processors and 

instant application, naming as inventors Robert L. Jardine, to oonecfl ideQtif ^ nonfunctional processors . 

Murali Basavaiah and Karoor S. Krishnakumar, under an „ . , , . , . . 

... r * * *i_ • • Regrouping works only when physical communication 

obligation of assignment to the assignee of this invention. 30 & k / « £iL , . , 

among processors remains possible, regardless of the logical 

U.S. patent application Ser No. 08/790,030 entitled, state of mc processors. If a processor loses all of its 

"Method and Apparatus for Node Pruning a Multi-Processor communications paths with other processors, that processor 

System for Maximal, Full Connection During Recovery/' C annot be regrouped. It remains isolated until communica- 

filed on the same date as the instant application, naming as tions are restored the system is cold loaded (Such a 

inventors Murali Basavaiah and Karoor S. Krishnakumar, 35 pr0 c^r usuaUy stops itself because its self-checking code 

under an obligation of assignment to the assignee of this C annot send and receive message system packets to and from 

invention. itself.) 

U.S. patent application Ser. No. 08/789,257 entided, a processor's logical state and its condition are distin- 

"Method and Apparatus for Distributed Agreement on Pro- guished. A processor has two logical states in a properly 

cessor Membership in a Multi-Processor System During configured system: up or down. However, a processor has 

Recovery," filed on the same date as the instant application, three conditions: dead, which is the same as the down logical 

naming as inventors Robert L. Jardine, Murali Basavaiah, sta te; healthy, which is the same as the up logical state; and 

Karoor S. Krishnakumar, and Srinivasa D. Murthy, under an malatose, which is described further below, 

obligation of assignment to the assignee of this invention. ^ A processor ^ dead if it ^ not communicate with the 

BACKGROUND OF THE INVENTION Mt °\ °ead processors include those, for 

example, that execute a HALT or a system freeze 

Distributed, shared-nothing multi-processor architectures instruction, that encounter low-level self -check errors such 
and fault-tolerant software using process pairs require that as internal register parity errors, that execute infinite loops 
all processors in a system have a consistent image of the 50 with au * interrupts disabled, that execute non-terminating 
processors making up the system. (The NonStop Kernel® instructions due to data corruption or that are in a reset state, 
available from the assignee of this application is an example Dead processors are harmless, but the regrouping algo- 
of such fault-tolerant software.) This consistent system rithm removes them from the system configuration. Other 
image is crucial for maintaining global system tables processors detect dead processors and declare them down, 
required for system operation and for preventing data cor- 55 A processor is healthy if it is running its operating system 
ruption caused by, say, an input/output process pair (IOP) of (preferably, the NonStop Kernel® operating system avail- 
primary and backup processes on different processors able from the assignee of the instant application) and can 
accessing the same I/O device through dual-ported I/O exchange packets with other processors (preferably, over a 
controllers or a shared bus (such as SCSI). redundant high-speed bus or switching fabric) within a 

Detection of processor failures occurs quickly with an 60 reasonable time. The regrouping algorithm prevents a pro- 
IamAlive message scheme. Each processor periodically cessor declaring down a healthy processor, 
sends lamAlive packets to each of the other processors in the A malatose processor is neither dead nor healthy. Such a 
system. Each processor in a system determines whether processor either is not responding in a timely manner 
another processor is operational by timing packets from it. (perhaps because of missing timer ticks) or is temporarily 
When the time interval passes without receipt of a packet 65 frozen in some low- level activity. A malatose processor 
from a given processor, the first processor decides that the might be, for example, flooded with highest-priority inter- 
second might have failed. rupts such that the processor cannot take lower-priority 
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interrupts or might be flooded with lower-priority interrupts and a data structure herein termed the regroup control 

such that the processor falls behind in issuing IamAlive template #_700 shown in FIG. #__7. A variable herein 

packets. A malatose processor might be waiting for a faulty termed SEQUENCE_NUMBER contains the current 

hardware device on which the clocks have stopped or might regroup sequence number. 

be running too long with interrupts disabled by the mutual 5 Eacfa ^ ^ m h me foUowing stages while 

exclusion mechanism. • 04 n i, . . . * ' 4 , f A ° ~ . 

_ , running: Stage 0, Stage 5 and Stages 1 through 4. Stage 0 is 

The regrouping algorithm detects a malatose processor , ^ deflned ^ me processc o ntro i b iock at system 

and forces it to become either healthy or dead, that is to say £ eiztioD . |, 5 * the F stable state described / bove 

either up or down. Co rrespondingly, a processor halts itself * - , r\ . . . . _ , , 

when another processor Yhat if has not declared down in 1 through 4 together make up the perturbed state also 

declares it down. 10 described abov e- 

With regard to regrouping, each processor in the system A processor maintains the current stage in the variable 

is either stable (that is, waiting for the need to act) or STAGE. Also, the processor maintains the variables 

perturbed, including several states described below. KNOWN_STAGE_l through KNOWN__STAGE_4 for 

While a processor is stable, the IamAlive message scheme each of Sta g es 1 through 4, respectively. Each of these 

continues to operate. If a predetermined amount of time, say, variables is a bit mask that records the processor numbers of 

2.4 seconds, passes without an IamAlive message from all processors known to the maintaining processor to be 

another processor, the processor becomes perturbed. participating in a regroup incident in the stage corresponding 

While perturbed, a processor exchanges specially marked to the variable, 

packets with other perturbed processors to determine the A processor enters Stage 0 when it is cold loaded. While 

current processor configuration of the system. When that it is in Stage 0, the processor does not participate in any 

configuration is agreed upon, the processor becomes stable regrouping incident. Any attempt to perturb the processor in 

a 8 am - this state halts the processor. The processor remains in Stage 

Processors spend most of their time stable. q until its integration into the inter-process and inter- 

A regrouping incident begins when a processor becomes processor message system is complete. Then the processor 

perturbed and ends when all processors become stable again. enters Stage 5. FIGS. #_8A and #__8B summarize subse- 

Each regrouping incident has a sequence number that is the qucnt actions. 

number of regrouping incidents since the last system cold A regrouping mcident normally begins when a processor 

° fails to send an IamAlive packet in time, step #_810. This 

Each processor also maintains variables to store two 3Q f a fl ure perturbs the processor that detects the failure, 

configurations, one old and one new. While a processor is w . , nf ^ Mrt , - _^ lt4 _j cto „ u eiu 

stable, bit-map variables called OUTER SCMEN and ™f * processor is perturbed step #_805, it enters 

INNER„SCREEN both contain the old configuration. n Tnf ^ 

„ „ . . , . , . part of the same regrouping mcident, step #_830. Because 

While a processor is stable, it knows that every processor a new inciden , can sUrt before m Mt[ ooe fc finishe(1 , a 

in the old configuration is up and every processor not m the 3J method j, ne6ded , 0 ensure ^ ±e parlici p ating processors 

old configuration is down. Each processor in the old con- ^ onl me latest incident . 

figuration has the same regrouping sequence number. _, _ „ „ . ... . „. _ 

„_.. . ° i j . , j • r FIG. #_9 summarizes the transition from Stage 5 to Stage 

Whik a processor is perturbed, it broadcasts its view of L ^ ^ ^ SEQUENCE NUMBER 

Aeco^guraUonCand^c^statu^onitebussesorfabn^ # ?10 ^ ^ s # ?20 t0 j ^ 

It sends this view periodically, for example, every 0.3 An ota^t? • ui . j *u * ** , • ~ 

, 4 i, • , , n 40 STAGE_n vanables to zero, and then sets its own bit m 

seconds, to all other processors in the old configuration. vi<rr\\\r\r m^c t u tea t /r-u j 

„ . . V . . KNOWN_STAGE 1 #_750a to 1. (The processor does 

Receiving such a broadcast perturbs any stable processor in t 4 , u . , *i_ *i_ •* ir i_ i* L \ 

r 3 v not yet know which processors other than itself are healthy.) 

the configuration. J ' 

The four stages of the regrouping protocol described Tto n»Bagp sjrtem awakens the processor periodically, 

further below make all perturbed processors create the same 45 ° J SeC ° ndS m ° De embodimen, > 50 the processor can 

view of the system configuration. When regrouping make mree to six attempts to receive acceptable mput. More 

1 4 n . . ~\ . j than three attempts occur it more than one processor in the 

completes, all processors m the system are stable and , " \, ai ~ 1 "*' u> ^ "^'^ mau " * y 1 ^^^ 1 ^^ 

. . c * old configuration remains unrecognized, it a power up has 

contain the same new configuration. Also, every processor in ^ 6 , . ^ \ ^ r, 

4U c »• u 7u occurred, or 11 the algorithm was restarted as a new incident, 

the new configuration has the same regroup sequence num- ' 6 

ber that is greater than the number in the old configuration. 50 When awa ^ned, the processor broadcasts its status to the 

The new configuration contains no processor that was not ? ld «mflg?iration of processors, step #^830. Its status 

in the old configuration. All processors that remained includes its regroup control template #_700. 

healthy throughout the incident are in the new configuration. Typically, status packets from other perturbed processors 

Any processor that was dead when the incident began or eventually arrive. If a packet arrives from a processor that 

that became dead during the incident is not in the new 55 was not in me old configuration as defined by the OUTER_ 

configuration. Regrouping restarts if a processor becomes SCREEN #_730, this processor ignores the packet and 

dead during an incident. responds with a poison packet. 

Correspondingly, processors that were malatose when the For a P*** il does not W 0 ™' the processor corn- 
incident began are in the new configuration as healthy P ares tne sequence number in the packet with the 
processors if they participated in the complete incident. 60 SEQUENCE_NUMBER #_710. If the packet sequence 

The regrouping method ensures that all processors in the number . is !° wer > * en «?*f * QOt participating in the 

new configuration have included and excluded the same mcident 0ther data in the P acket 15 not current and 

nmrpccnn: >s ignored. The processor sends a new status packet to that 

processor to synchronize it to make it participate in the 

Processor Stages of Pre-Existing Regroup 65 current incident. 

Each processor regrouping according to the pre-existing If the sequence number in the packet is higher than the 

algorithm maintains an EVENT _HANDLERQ procedure SEQUENCE_NUMBER #_710, then a new incident has 
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started. The SEQUENCE_NUMBER #_710 is set to the tion. They discard any outstanding transmissions to any 

sequence number in the packet. The processor reinitializes excluded processor and discard any incoming transmissions 

its data structures and accepts the rest of the packet data. from it. Inter-processor traffic queues are searched for mes- 

If the sequence number in the packet is the same as the f eued from requesters/linkers in the excluded pro- 

SEQUENCE NUMBER # 710, then the processor simply 5 """^"J f Any uncanceled messages foundare 

accepts the packet data. Accepting the data consists of discarded - ^"-processor traffic queues are searched for 

1 * ii nn *u vxtau rK t o^a^^ c u • messages queued from servers/listeners in the excluded 
logically ORing the KNOWN_STAGE_n fields in the ^ no( uncanceled messa es found 

packet with the corresponding processor variables #_7S0 to ^ t<aM tQ a deferred for p ^ cessing 

merge the two processors knowledge into one configura- during Stage 4 

tlon * This cleanup ensures that no message exchanges begun by 

Stage 1 ends in either of two ways. First, all processors a server/listener application in a processor in the new 

account for themselves. That is to say, when a processor configuration remain unresolved because of exclusion of the 

notices that its KNO WN_STAG E__l variable #_75©a other processor from the new configuration. All messages 

includes all processors previously known (that is, equals the that could be sent to the excluded processor have been sent; 

OUTER_SCREEN #_J730), then the processor goes to 15 and all messages that could be received from it have been 

Stage 2. received. 

However, in the event of processor faihire(s), the proces- Most processor functions occur as bus or timer interrupt 

sors never all account for themselves. Therefore, Stage 1 handler actions * Because some cleanup activities take a long 

ends on a time out. The time limit is different for cautious „ they cannot be done with interrupts disabled. Instead, 

and non-cautious modes, but the processor proceeds to Stage 20 ^ «*™*x are separated from others for the same stage 

2 when that time expires whether all processors have and defcrred - 

accounted for themselves or not ^ deferred cleanup is done through a message-system 

„ 1ft . . _ tUU • - SEND_QUEUED_MESSAGES procedure that is invoked 

FIG. #_10 summarizes the transiUon from the beginning . j* *u / lL i. j i \ ™_ j r , 

of Stage 1 to the end of Stage 1. At the end of SUge 1, ^ **** f er ^ P^cess scheduler). The deferred 

KNOW_STAGE_l # 750a identifies those processors * ctw f? s are , th6n ^ ^ ffT 

that thisprocessor recogntees as valid processors with which «f™P* enabl6d ™* ° f *o time, 

to communicate during the current incident. In later stages, Penodic checking for input and U» broadcasting of status 

the processor accepts packets only from recognized proces- continues. When the deferred cleanup mentioned earher 

„_ finishes, the processor records its status in KNOWN_ 

30 STAGE_3 #_750c 

Stage 2 builds the new configuration by adding to the set packcb ^ SCR£EN and 

of processors recognized by the processor all of those OUTER SCREEN # 740. # 730 are merged into the 

processors recognized by recognized processors, step ^ - - ^ 

#_850. In effect, the new configuration is a consensus STAGE 3^ 750c equals KNOWN STAGER # 750*, all 

amoug communicating peers. 35 - a ~~ L , ~ . ' 

& & r processors in the new configuration have completed similar 

FIG. summarizes conditions at the beginning of cleanup and are all in Stage 3. FIG. #_14 summarizes 

Stage 2. The processor sets the Stage #_720 to 2, records its conditions at the end of Stage 3. 

status in KNOWN STAGE 2, and copies KNOWN r „ c ., t . rt _™ r t u„ ^t,™. „f 

^^t, « . ^™^V,»t ^ .il „ ~* In Stage 4, the processor completes the cleanup actions of 

STAGE_1 to the INNER_SCREEN # 740. The processor stage 3 and notifies processes U,at one or more processor 

continues checking for input and broadcasting status « faflu£es have occurred> #870 ^ processor incre . 

^i^*^^^^/^^^^^^^ meats the Stage #_720 to 4 and does the following: sets 

J^tS^VS? INNER-SCREEN #_730, processor . stan t variables to show excluded processors in 

— , step — . ^ c j own state- changes the locker processor, if necessary, 

Packets from old-configuration processors that did not for use in the GLUP protocol as described herein; processes 

participate in Stage I are identified by the INNER_ messages deferred from Stage 3; manipulates I/O controller 

SCREEN #_740 and ignored. Packets from recognized ^les when necessary to acquire ownership; and notifies 

processors are accepted, and their configuration data is requesters/linkers. 

merged into the KNOWN_STAGE_n variables. When a Stage 4 ^ the first ^ at whicb failurc of another 

packet from a recognized processor identifies a previously processor can be known by message-system users in the 

unrecognized processor, the new processor is also added to current processor. This delay prevents other processes from 

the INNER_SCREEN #_740. Malatose processors that beginning activities that might produce incorrect results 

may have been too slow to join the current regroup incident because of uncanceled message exchanges with the failed 

in Stage 1 can thus still join in Stage 2. processor 

When KNOWN_STAGE_2 #_7506 becomes equal to 55 The regrouping processor continues to check for input and 

KNOWN_STAGE__l #_750a, no further changes to the to broadcast status, step #_870. When the deferred cleanup 

configuration can occur. FIG. #_12 summarizes conditions finishes, the processor records its status in KNOWN_ 

at the end of Stage 2. Stage 3 now begins. STAGE_4 #_75M. FIG. #_15 shows this action. 

At the beginning of Stage 3, as shown in FIG. #_13, the Packets that make it past the INNER_SCREEN and the 

processor increments the Stage #_720 and copies the new m OUTER_SCREEN #_740, #_730 are merged into the 

configuration to both the INNER_SCREEN and the KNOWN„STAGE_n variables #_750. When KNOWN_ 

OUTER_SCREEN #_740, #_730. A malatose processor STAGE__4 U_150d equals KNO WN_STAG E_3 #_750c, 

can no longer join the new configuration as a healthy all processors in the new configuration have completed 

processor. similar cleanup and are all in Stage 4. FIG. #_16 summa- 

Message-system cleanup, step #_860, is performed as 65 rizes conditions at the end of Stage 4. 

follows: The processors in the new configuration shut off the At the beginning of Stage 5, the Stage #_720 becomes 5. 

message system to any processor not in the new configura- One final broadcast and update occur. The OUTER_ 
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SCREEN #_730 contains what has now become the old 
configuration for the next regrouping incident. FIG. #_17 
shows this situation. 

Finally, higher-level operating system cleanup can now 
begin. Global update recovery starts in the locker processor. 

The processor does its own cleanup processing. Attempts 
to restart the failed processor can now begin. 

Stopping and Restarting an Incident 

A processor must complete Stages 2 through 4 within a 
predetermined time, 3 seconds in one embodiment. If it does 
not complete those stages within that time, some other 
processor has probably failed during the regrouping. 
Therefore, the incident stops and a new incident starts with 
the processor returning to the beginning of Stage 1. Any 
cleanup that remains incomplete at the restart completes 
during the stages of the new incident. Cleanup actions either 
have no sequencing requirements or have explicitly con- 
trolled sequences so that they are unaffected by a restart of 
the algorithm. 

During the restart, the INNER_SCREEN and the 
OUTER_SCREEN #_J740, #_730 are not reinitialized. By 
not changing these variables, the processor continues to 
exclude from the new configuration any processors that have 
already been diagnosed as not healthy. Processors known to 
be dead are excluded by the OUTER^SCREEN #_740. 
Processors previously recognized as healthy are the only 
ones with which the INNER_SCREEN #„730 permits the 
processor to communicate. 

The processor accepts status only from recognized pro- 
cessors. Therefore, only a recognized processor can add 
another processor to the configuration before the end of 
Stage 2. As Stage 2 ends and Stage 3 begins, the regrouping 
processors exclude the failing processor that caused the 
restart from the new configuration when the KNOWN_ 
STAGE_2 #_750b is copied to the OUTER_SCREEN and 
INNER_SCREEN #__740, #_730. After Stage 2 ends, the 
configuration does not change until a new incident starts. 

Power Failure and Recovery Regrouping 

When a processor is powered up, it causes a new incident 
to start. A word in a broadcast status packet indicates that a 
power failure occurred so that receiving processors can clear 
bus error counters and refrain from shutting down the 
repowered processor's access to the busses or fabric. 
Depending on the characteristics of the interprocessor com- 
munications hardware (busses or fabrics), errors are more 
likely just after a power outage when components are 
powering on at slightly different times. 

Effects of Inter-Processor Communications Path 
Failures 

The effect on regrouping of a failure of inter-processor 
communications paths (IPCPs) depends on whether the 
failure is transient or permanent. A transient failure is one 
that allows occasional use of the IPCPs to transmit packets. 
A permanent failure is one that prevents any packet from 
passing through that component until the component is 
replaced. 

Transient IPCP failures during Stage 1 normally do not 
affect regrouping. More than one attempt is made to transmit 
a status packet, and redundant communications paths are 
used for each packet Transmission is almost always suc- 
cessful. If transmission on the redundant paths does fail, 
either the algorithm restarts or the processor stops. 
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A successfully transmitted packet can be received as one 
of three types: unique, because a transient IPCP failure 
occurred and the other copy of the packet could not be sent; 
duplicated, because it was received over redundant IPCPs; 
5 or obsolete, because a processor transmitted a status packet, 
had its status change, and then transmitted a new status 
packet, but one or more paths delivered the status packets 
out of order. 

The regroup control template variables are updated by 
10 setting bits to 1 but never by setting them to 0. Duplicated, 
obsolete, or lost packets do not change the accuracy of the 
new configuration because a bit is not cleared by subsequent 
updates until a new incident starts. No harm follows from 
receiving packets out of order. 
15 The handling of permanent IPCP failures differs. 

When a processor cannot communicate with itself over at 
least one path, that processor halts with an error. This action 
means that when alt redundant IPCPs fail, the system halts 
20 all processors automatically. Regrouping becomes irrel- 
evant 

Failure of an IPCP element or IPCP-access element does 
not affect regrouping as long as one two-way communica- 
tion path remains between two processors. A processor that 
25 cannot communicate with at least one other processor halts 
itself through the monitoring function of the regrouping 
processor. 

A processor that can communicate with at least one other 
processor is included in the new configuration because the 

30 new configuration is achieved by consensus. When each 
processor receives a status packet, it adds the reported 
configuration to update its own status records. This com- 
bined configuration is automatically forwarded to the next 
processor to receive a status packet from the updating 

35 processor. 

For example, consider the following situation: Given 
redundant IPCI^ X and Y, processors 0 and 2 can send only 
on IPCP X and receive only on IPCP Y. Processor 1, on the 
other hand, can receive only on IPCP X and send only on 
40 IPCP Y. Thus, processors 0 and 2 have a communication 
path with processor 1. Eventually, all three processors will 
have the same new configuration. The processor status 
information from both processors 0 and 2 will have been 
relayed through processor 1. 

45 

Unresolved Failure Scenarios 

The pre-existing regroup algorithm works well for pro- 
cessor failures and malatose processors. There are, however, 

50 certain communications failure scenarios for which it does 
not work well. In understanding these scenarios, conceive of 
a working multi-processing system (such as a NonStop 
Kernel® system) logically as a connected graph in which a 
vertex represents a functioning processor and an edge rep- 

55 resents the ability for two processors to communicate 
directly with each other. For a system to operate normally, 
the graph must be fully connected, i.e., all processors can 
communicate directly with all other processors. A logical 
connection must exist between every pair of processors. 

60 (The graph is a logical interconnection model. The physi- 
cal interconnect can be a variety of different topologies, 
including a shared bus in which different physical intercon- 
nections do not exist between every pair of processors.) 
In the first scenario, two processors in the system come to 

65 have inconsistent views of the processors operating in the 
system. They disagree about the set of vertices composing 
the graph of the system. A "split brain" situation is said to 
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have occurred. This split-brain situation can lead each of the healthy. The processors quickly dub the regroup incident a 

primary and backup of an I/O process pair that resides across false start and declare no processors down. A new regroup 

the split brain to believe that it is the primary process, with incident nonetheless starts the next time a processor detects 

data corruption as a result. the missing IamAlives. Thus, the system goes through 

Generally, split-brain situations can occur if communica- 5 periodic regroup events at the IamAlive-checking frequency 

tion failures break up a system into two or more distinct (e.g., once per 2.4 seconds), which terminate almost imme- 

clusters of processors, which are cut off from one another. diately without detecting the failure. 

The connectivity graph of the system then breaks into two or Accordingly, there is a need for a multi-processor regroup 

more disjoint connected graphs. operation that avoids these split-brain, partial -connection 

In the second scenario, communication failures result in 10 and timer-failure scenarios, 

the connectivity graph becoming only partially connected. A goal of the present invention is a multi-processor 

This happens when communication between a pair of pro- computer system wherein the constituent processors main- 

cessors fails completely in spite of redundant paths. When tain a consistent image of the processors composing the 

one of the processors notices that it has not received system. 

lamAlive messages from the other for a certain period, it 15 Yet another goal of the present invention is a multipro- 

activates a regroup operation. If, however, there is a third cesser computer system wherein the constituent processors 

processor with which the two can communicate, the pre- are fully connected when the system is stable, 

existing regroup operation decides that all processors are Y et another object of the present invention is a multipro- 

healthy and terminates without taking any action. A message cessor computer system wherein the failure of the processor 

originating on either of the processors and destined to the 20 to recdve timer expirations ^ detected and the processor 

other processor hangs forever Both processors are healthy, declared down 

and a fault-tolerant message sjrctem guarantees that mes- ^ of ^ t invention ^ such a mm _ 

sages will be dehvered unless the destination processor or processorsystem>wheresaiaplocessorsaremaxima u y folly 

process is down. Until a regroup operation declares the when ^ systenl is ^j,,. 

destination processor down, the message system keeps retry- ^ . . • ™ . . . . . - . 

r § , \ & J . *i_ • An object of the invention is such a multi-processor 

rog the message b»t makes no progress since there .s no , J w here the system resources (particularly, 

communication path between the processors. 7 \ *u * u a * c ^ - * 1 

, , . , , ; , processors) that may be needed for meeting integrity and 

Id this second scenario the whole system can hang due to connectivity requirements are minimally excluded. 

°^ T rmipf ? foUowing circumstances: The global ^ of ^ mvemioa fe ^ a multi ^ 

update (GLUP) protocol (described in U.S. Pat. No. 4,718, fa ^ . reerouoing the svstem takes into 

002 (1988), incorporated herein by reference) that is used for Y wnere, wnen regrouping, me system takes into 

j \ / ' j , , ^ , } 77 \ account any momentarily unresponsive processor, 

updating the replicated kernel tables assumes that a proces- ^ ' , / „ L . 

- * - 4 u ii u . i These and other goals or the mvention will be readily 

sor can communicate with all healthy processors in the _ * . , .„ . «_ * 

system. If GLUP starts on a processor that cannot commu- to °™ °\^ m ^ m the art °» * e readm S of 

nicatewithoneofmehealmyproc^ 35 me b**gnmnd above and the description following. 

hangs in the whole system, preventing the completion of SUMMARY OF THE INVENTION 

activities such as named process creation and deletion. A Herein is discbsed a method and apparatus for tolerating 

system may also hang if a critical system process hangs mc loss of timcr ticks in a mu lti-processor computer system, 

waitmg for the completion of a hung message. ^ ^ mu iti- pr ocessor system includes multiple processors, 

Such system hangs could lead to processors halting due to each having a respective memory. The method and apparatus 

the message system running out of resources. include subjecting each of the multiple processors to a 

Where the inter-processor communication path is fault- method including respective advancement from a first to a 

tolerant (e.g., dual buses) while the processors are fail-fast second stage, initially placing the each processor in the first 

(e.g., single fault-detecting processors or lock-stepped pro- 45 stage; then sending status of advancement of a second of the 

cessors running the same code stream, where a processor multiple processors. A processor receives the status of 

halts immediately upon detecting a self-fault), the likelihood advancement of the second processor and updates its status 

of communication breakdown between a pair of processors only if notification of a time expiration has occurred on the 

becomes far less likely than the failure of a processor. receiving processor. Each processor which has updated its 

However, a software policy of downing single paths due to 50 status advances to the second stage. The determination that 

errors increases the probability of this scenario. timer expirations have failed on a processor occurs when the 

Further, with the introduction of complex cluster multi- processor fails to advance from the first stage, 

processor topologies, connectivity failure scenarios seem BRIEF DESCRIPTION OF THE DRAWINGS 

more likely. These could be the result of failures of routers, . , . , 

defects in the system software, operator errors, etc. 55 ^JJ^J-J sun P llfied block dia 5 ram of a mul,1 P le 

In the third scenario, a processor becomes unable to send processing system, . 

the periodic lamAlive messages but nonetheless can receive FIG # - 2 15 a g ra P h ^presenting a five-processor multi- 

and send inter-processor communication messages. (Such a processor system, 

situation results from, for example, corruption of the time nG - # -3 B a gfaP h representing a two-processor multi- 
list preventing the reporting of timer expirations to the 60 P™ 065501 system; 

operating system.) One of the other processor readily detects F* G * 4 is the graph of FIG. #_2, subjected to com- 

this failure of the processor and starts a regroup incident. munications faults; 

However, since the apparently malatose processor can FIG. #_5 the graph of FIG. #_3, subjected to commu- 

receive the regroup packets and can broadcast regroup nications faults; 

packets, the faulty processor fully participates in the regroup 65 FIG. #_6 is a flow diagram illustrating Stage I of the 

incident. This participation is sufficient to convince the other regroup operation according to one embodiment of the 

processors that the apparently malatose processor is in fact invention; 
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FIG. #_7 is a diagram of the regroup control template; 
FIGS. #_8A and #_8B summarize the steps of a regroup 
operation; 

FIG. #_9 summarizes the transition from Stage 5 to Stage 
1 according to one embodiment of the invention; 

FIG. #_10 summarizes the transition from the beginning 
of Stage 1 to the end of Stage 1 according to one embodi- 
ment of the invention; 

FIG. summarizes conditions at the beginning of 

Stage 2 according to one embodiment of the invention; 

FIG. #__12 summarizes conditions at the end of Stage 2 
according to one embodiment of the invention; 

FIG. #_13 shows the status at the beginning of Stage 3 
according to one embodiment of the invention; 

FIG. #__14 summarizes conditions at the end of Stage 3 
according to one embodiment of the invention; 

FIG. #__15 shows the status at the beginning of Stage 4 
according to one embodiment of the invention; 

FIG. #__16 summarizes conditions at the end of Stage 4 
according to one embodiment of the invention; 

FIG. #_17 shows conditions at the beginning of Stage 5 
according to one embodiment of the invention; and 

FIGS. #_18A and #_18B are flow diagrams illustrating 
an embodiment of the split brain avoidance protocol accord- 
ing to one embodiment of the invention. 

DESCRIPTION OF THE PREFERRED 
EMBODIMENT 
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canonical matrix: a connectivity matrix C is in canonical 
form if and only if: 

(1) if a processor i is dead, the row C(i^c) is FALSE, and 
the column C(x4) is FALSE; and 

(2) if C(ij) is FALSE, C(j4) is FALSE. This ensures 
symmetric or bidirectional connectivity. 

connected graph: a graph in which no processor is isolated 
from all other processors in the graph. 

connectivity matrix: an NxN matrix C such that: 
N is the number of processors; 

each processor is uniquely numbered between 1 and N (or 

between 0 and N-l if zero indexing is used); 
C(U) is TRUE if processor i is healthy; 
C(U) is FALSE if processor i is dead or non-existent; 
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C(i,j) is TRUE if processor i is connected to processor j 
and i*j; and 

C(ij) is FALSE if processor i is not connected to proces- 
sor j and i*j. 

disconnect: in a graph, the lack of an edge between two 
processors; a "missing" edge in a graph; a pair of processors 
between which there is no edge; a pair (i j) such that C(i j) 
is FALSE or C(j,i) is FALSE. 

fully connected graph: a graph in which each processor 
has an edge with all other processors. 

graph: a representation of the processors within a multi- 
processor system and of the communication links among 
those processors. The vertices of the graphs are the 
processors, and the edges are the communication links. The 
edges are bi-directional. 

The terms "vertex** and "processor" are used 
interchangeably, as are the terms "communication 
link/* "link" and "edge." 
(Redundant links between a pair of processors are con- 
sidered together as one link. In this embodiment, the 
communication network is ServerNet®, available from 
the assignee of the instant application, and the com- 
munication links are ServerNet® paths. A ServerNet® 
path is a sequence of ServerNet® links and routers.) 
group: a proper subset of the processors in a multi- 
processor system. The subset of processors is interconnected 
communicatively When a fully connected multi-processor 
system breaks into groups, the groups are disjoint and may 
not be fully interconnected. 

maximal, fully connected subgraph: a fully connected 
subgraph that is not a proper subset of another fully con- 
nected subgraph of the same graph. 

Overview 



35 

The multi-processor systems of the invention may be 
constructed, using the teachings of the U.S. Pat. No. 4,817, 
091, issued Mar. 28, 1989 (Attorney Docket No. 010577- 
49-3-1) and U.S. patent application Ser. No. 08/486,217, 

40 entitled "Fail-Fast, Fail-Functional, Fault-Tolerant Multi- 
processor System," filed Jun. 7, 1995, naming as inventors 
Robert W. Horst, et al., under an obligation of assignment to 
the assignee of this invention, with Attorney Docket No. 
010577-028210/TA 214-1. Therefore, U.S. Pat. No. 4,817, 

45 091 and U.S. patent application Ser. No. 08/486,217 are 
incorporated herein by reference to the extent necessary. 

FIG. #_1 is a simplified block diagram of a multi- 
processor system incorporating the present invention. The 
processors #_112 are interconnected by a network #_114 

so and connections #__116 that provide the processors #_112 
with interprocessor communication via transceivers #_117. 
The network #__114 may be implemented by a standard 
communications interconnect such as an Ethernet LAN or 
by a bus system that interconnects processors #_112, in 

55 parallel, and is independent from any input/output (I/O) 
system that the processors may have, such as is taught by 
U.S. Pat. No. 4,817,091, mentioned above. Alternatively, the 
network #_114 could be implemented as part of a joint I/O 
system that provides the processors #_112 not only with 

60 access to various I/O units (e.g., printers, secondary storage, 
and the like — not shown) but also provides communication 
paths for interprocessor communication for the processors 

# 112. The network # 114 can also be any point-to-point 

network such as rings, fully-connected stars and trees. 

65 Internal to or otherwise associated with each of the 

processors #_112 is a memory # 118 that is independent 

from the memory #_J18 of the other processors #__112 and 
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a time-of-day clock (not shown) independent of the time- system. In one embodiment, each entry in the matrix is a bit, 

of-day clocks of the other processors #_U2. Also associ- and each processor #_112 is uniquely numbered between 1 

ated with each of the processors #_112 is a power supply and N. An entry C(ij) indicates the ability of processor i to 

#_120 that receives primary power (e.g., alternating current, receive a message from processor j. Herein, if the ability 

not shown) to supply therefrom the necessary electrical 5 exists, the entry is set to one (or logical TRUE). If the ability 

power (e.g., direct current) for operation of the associated does not exists, the entry is set to zero (or logical FALSE), 

processor #_U2. An entry C(i,i) is set to TRUE if the processor i is healthy. 

In one embodiment, internal to or otherwise associated The entry C(i,i) is FALSE if the processor i is dead or 

with each of the processors #_112 is a configuration option non-existent. If a processor does not get Regroup messages 

register #_119. The use of the configuration option register 1° from itself, it halts. 

#_119 is taught in U.S. patent application Ser. No. 08/487, en try C(ij) is set to TRUE if the processor i is 

941 entitled, "Method to Improve Tolerance of Non- communicatively connected to the processor j (>j). The 

Homogeneous Power Outages," naming as inventors Robert entry c(ij) is set to FALSE if the processor i is not 

L. Jardine, Richard N. Collins and A. Richard Zacher, under communicatively connected to processor j (i*j). 

an obligation of assignment to the assignee of the instant 15 0 , Ji-M^i *** j 

. , A „ & ^ * . w m ™,r* Each processor #_112 also maintains a node pruning 

invention, with Attorney Docket No. 010577-033000/TA _ llf ^ „„, - „ _ u , - . , • ^ , f 

wt p . . o vr noiAonnA* - * result vanable. The pruning result variable is also a bit- 

272. UA patent application Ser. No. 0*487,941 is incor- structurej ^ ^ Qf a multi cessof 

porated herein by reference. ^ ^ Qode prU[ling protocol described her ei n be- 

The network #_U4 forms the medium that allows the j ow 

processors # 112 to send and receive messages to and from 20 A , t t ^ . V t ... . 

T i i . f Another data structure is the lamAhve message. In one 

one another to communicate data, status, and other infor- , , ... A . 

„ . . « ,. . ~ . . , , . embodiment, an lamAhve message contains an ldentinca- 

mation therebetween. The medium is preferably a redundant . r 4 , *, , 6 ^ 

, ... . , „ t *u •_ * * tion of the broadcastmg processor # 112, among other 

network with at least two paths between every pair of . . 4 . „„ 6 y c „ _ . ' . * 

r J r information. When successfully communicated, an 

^ ir °?? SOrs " . . 25 lam Alive message indicates to the receiving processor 

FIG. #_2 is a graph #_200i representing a five-processor # m me contimied operation of the broadcasting proces- 

multi-processor system #„200. The graph #_200 of FIG. S or # 112 

#_2 is fully connected. Each of the five processors 1-5 has „ .77 ^ 

. 4 . i- i m. it **u *u t * Still another data structure is the Regroup message. A 

a communications link with all of the other processors 1-5. „ 

, t _ . , „ . Regroup message identifies the broadcasting processor 

FIG. #_3 is a graph* ^3(Wrepresentmg a two-processor J0 # 112 and eoa ^ BS mat processor > s connectivity matrix, 

multi-processor system #_30a The system #_300 of FIG. ^ , Regroup message contains that processor's view of 

#_3 is also fully connected. The two processors 1, 2 are m me including the identification of those processors 

communication wuh each other. # m ft fom me system ^ RegrQup message 

Now assume that faults occur that divide the system includes a pruning result variable and a cautious bit as well. 

#_200 into the graph # 400 of FIG. # 4. In the graph « . u . . ... . ,- 

a a™, .l c . T j m ■ *l i, A multi-processor system according to one embodiment 

#_400 the group of processors 1, 3 4 and 3 is My of ^ a ^ J waichable . 

connected, and the group of processors 1, 2 and 5 is fully ™ , . KT ... . XT . , - r 

' & r r f j sors. The mask is N-bit, where N is the number of processors 

connected 

#_112 in the multiprocessor system, each entry in the mask 

The processors of the graph #_400 all enter a regroup fe a bit> ^ each processor # _ U2 ^ uniquely numbered 
operation on the detection of toe communication failures. 40 between 1 and N. The maintenance and use of this mask is 
According to the present invention, in order to avoid split- explained below 
brain problems and to maintain a fully connected multipro- 
cessor system, the processor 2 halts operations, while each Protocols 
of the processors 1, 3, 4 and 5 continues operations. Tie-breaker processor Selection 

Similarly, where communications failures divide the sys- 45 One of the processors # 112 has a special role in the 

tern #_300 into the subgraphs of the processor 1 only and regroup process of the invention. This processor # 112 is 

of the processor 2 only of the system #__500 of FIG. #_5, designated the tie breaker. As described below, the split- 
the processors perform a regroup operation. According to brain avoidance process favors this processor #_J12 in case 
the present invention, in order to avoid split-brain problems of ties. Further, the node pruning process (described below) 
and to maintain a fully connected multiprocessor system, the so used to ensure full connectivity between all surviving pro- 
processor 2 halts, while the processor 1 continues opera- cessors is run on the tie-breaker processor #__112. This 
tions. process also favors the tie breaker in case of large numbers 
Data Structures of connecUvity failures 

In one embodiment, the lowest numbered processor 

Described below are the data structures and protocols 55 #_U2 in a group is selected as the tie breaker. This simple 

used in a preferred embodiment to avoid split-brain, partial selection process ensures that all processors #__112 in the 

connection and timer-failure according to the invention. tnc same tie breaker. 

Each processor #_112 in a multi-processor system incor- Regroup and Split-Brain Avoidance 

porating the invention maintains a connectivity matrix C. Each of the processors #_112 of a multi-processor system 

The connectivity matrix is used to track the edges in the 60 according to the invention uses the network #_U4 for 

graph that survive communications failures. The connectiv- broadcasting IamAlive messages at periodic intervals. In 

ity matrix is also used to determine the maximal, fully one embodiment, approximately every 1.2 seconds each of 

connected subgraph to survive the communications failures the processors #_U2 broadcasts an IamAlive message to 

and to determine whether each processor #_112 is to each of the other processors #_U2 on each of the redundant 

continue or halt its operations. 6 s paths to each other processor #_U2. Approximately every 

The size of the connectivity matrix C is NxN, where N is 2.4 seconds each processor # _112 checks to see what 

the number of processors #_112 in the multi-processor IamAlive messages it has received from its companion 
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processors #_112. When a processor #_112 fails to receive At step #_668, a processor #_112 examines the Regroup 

an IamAlive message from a processor (e.g., #_U2£) that message(s) it has received and compares the connectivity 

it knows to have been a part of the system at the last check, matrix C contained in the message(s) with that the processor 

the checking processor #_U2 initiates a regroup operation #_U2 maintains in its memory #„118. If there are 

by broadcasting a Regroup message. 5 differences, the system view maintained in the memory 18 

In effect, a regroup operation is a set of chances for the is updated accordingly, 

processor #_112fc from which an IamAlive message was not In one embodiment, the connectivity matrix in a Regroup 

received to convince the other processors #_U2 that it is in message is an NxN bit matrix. This bit matrix is ORed with 

fact healthy. Processor #_112&*s failure to properly partici- an NxN bit matrix that a processor #_112 receiving the 

pate in the regroup operation results in the remaining 10 Regroup message maintains in its memory #_118. Thus, for 

processors #__112 ignoring any further message traffic from any processor i marked in any Regroup message as present, 

the processor #_1126, should it send any. The other pro- i.e., C(i,i) is set to TRUE in the Regroup message connec- 

cessors #_112 ostracize the once-mute processors) #_\12b tivity matrix, the processor #_112 marks that processor i as 

from the system. present in the memory -resident matrix, i.e., C(i,i) is set to 

Stage I 15 TRUE in the memory-resident connectivity matrix. 

Turning now to FIG. #_6, a flow diagram illustrates Stage Thus, the connectivity matrix can include the KNOWN- 

I of the regroup operation, indicated generally with the STAGE_ji variables #_J750 described above, 

reference numeral #_600. Each of the processors #_H2 In addition, when a processor i receives a Regroup 

executes Stage I of the regroup operation. In fact, as the message from a processor j (on any path), the processor i sets 

processors #_112 do not necessarily synchronize their 20 the C(ij) entry of its memory-resident connectivity matrix to 

operation, certain processors check for IamAlive messages TRUE, indicating that processor i can receive messages 

earlier than others and enter the regroup operation before the from processor j. 

others. As indicated above, two entries exist for the pair of 

A processor #_112 may also enter Stage I of the regroup processors i and j: G(ij) and C(j,i). The processor i sets the 

operation even though it has not detected an absence of any 25 entry C(i j) to TRUE when it receives a Regroup message 

IamAlive messages if it first receives a Regroup message from processor j, while the processor j sets the entry C(j 

from a processor #_Jll2 that has detected the absence of an to TRUE when it receives a Regroup message from proces- 

lamAlive message. sor i. This dual-entry system allows the multi-processor 

Thus, Stage I begins (steps #_662a or #_662fc) when a system to detect failures that break symmetry, i.e., processor 

processor #__112 notes either that a companion processor 30 i can receive from processor j but processor j cannot receive 

has failed to transmit its periodic IamAlive message (step from processor i. 

#_662a) or the processor #__112 receives a Regroup mes- Stage I completes when all known processors #_112 are 

sage from another of the processors #_U2 (step #_662£>). accounted as healthy, or some predetermined amount of time 

When a processor #_112 notes either of theses occurrences, has passed, 

it commences Stage I of the regroup operation. 35 Stage II 

Next, in addition to the actions of Stage I of the pre- The connectivity matrix is used to track the processors 

existing regroup operation, the processors #_112 participat- known in Stage I and to determine when the processors 

ing in the regroup operation each start an internal timer (not known in Stage II are the same as those from Stage I. In the 

shown) that will determine the maximum time for Stage I previously existing regroup operation, the processors exited 

operation, step #_664. Each processor #__112 also resets its 40 Stage II when the processors #_112 participating in Stage II 

memory-resident connectivity matrix C to all FALSE's (i.e., agree as to the view of the system #__100. In the regroup 

C(ij) is zero for all i,j). operation of the invention, Stage II continues after the 

Also at step #_664, each processor #_H2 suspends all processors agree as to the view of the system. 

I/O activity. (In one embodiment, a service routine holds all The connectivity matrix is also used to detect the lack of 

subsequent I/O requests in request queues rather than send- 45 full connectivity in the group of processors that survive the 

ing them out on the network #__114.) Only Regroup mes- initial stages of the regroup operation. After Stage I and (the 

sages may flow through the network #_114 during this beginning of) Stage II of the regroup operation have deter- 

period. The processors #_112 resume I/O activity only after mined the set of present processors in a connected subgraph, 

the regroup operation finalizes the set of surviving proces- each processor applies the split-brain avoidance methodol- 

sors (i.e., after Stage III). 50 ogy described below and illustrated in FIGS. #_J8A and 

At step #_666 each of the processors #_112 sends #__18B to ensure that only one subgraph of processors 
per-processor, per-redundant-path Regroup messages, con- survives. The methodology involves selecting a tie-breaker 
taining the processor's view of the system, including its own processor, step #_1805. A node-pruning protocol may sub- 
identity, a connectivity matrix C, and the optional cautious sequently be run to select a fully connected subgraph, 
bit. (The processors #_112 set and use the cautious bit 55 In one embodiment, each processor #_112 selects as the 
according to the teachings of U.S. patent application Ser. tie-breaker processor the processor #_112 that (1) was a part 
No. 08/265,585 entitled, "Method and Apparatus for Fault- of the system at the end of the last regroup operation to 
Tolerant Multi-processing System Recovery from Power complete (or at system startup, if no regroup operation has 
Failure or Drop-Outs," filed Jun. 23, 1994, naming as completed) and (2) had the lowest unique identifying num- 
inventors Robert L. Jardine, Richard M. Collins and Larry 60 ber. All processors #_112 will pick the same tie-breaker 
D. Reeves, under an obligation of assignment to the assignee processor #__112. 

of this invention. U.S. patent application Ser. No. 08/265, More loosely, the processors #_112 select as the tie- 

585 is incorporated herein by reference.) This Regroup breaker the processor #__112 that had the lowest unique 

message prompts all other processors #_112 — if they have identifying number just before the current regroup operation 

not already done so on noting the failure of a processor 65 began. This definition is more loose in that, as related above, 

#_U2 to send an IamAlive message — to also enter the the current regroup operation may have begun in the middle 

regroup operation. of an ongoing regroup operation. Thus, all of the processors 
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#_U2 may not agree as to all of the processors #_112 the lowest-numbered surviving processor as a tie breaker for 

known just before the current regroup operation began. the remainder of Stage II, the subsequent stages of the 

In applying the split-brain avoidance methodology of the regroup operation and in post-regroup operation, until 

invention, each processor #_112 makes the following deci- another tie breaker is selected as herein described. All 

S10ns: 5 processors #__112 that survive the application of the split- 

1. If its group has more than one-half of the processors brain avoidance methodology pick the same tie-breaker 
that were present before this regroup operation started, processor #_112. 

as given by the OUTER_SCREEN variable #_740 If me processo ; fe not the tie brea ker, then it ^ m Stage 

described above, then the processor continues n untfl it gels a message from tbe tie-breaker processor 

operations, steps #_1820 and #_1825. 10 #tL2 (or regroup restarts after a stall-detection time-out). 

2. If its group has less than one-half of the processors that jy^ completes the split-brain avoidance protocol. For a 
were present before this regroup operation began, then multi-processor system implementing the split-brain avoid- 
it halts itself immediately, steps #_1810 and #_1815. ance protocol without the node pruning protocol, Stages III 

3. If its group has exactly one-half of the processors that through V complete as described above. However, a system 
were present before this regroup, and its group has at 15 seeking to make itself or maintain itself as a maximally, fully 
least two processors, steps #__1830, then the tie-breaker connected multi-processor completes Stage II and continues, 
processor is used to break the tie as follows. as described below. (Of course, a multi-processor system 
3.1: If its group includes the tie-breaker processor, then can apply the node pruning methodology independently of 

the processor continues operations, steps #_1840 the split-brain avoidance methodology.) 

and #_1825. 20 Regroup and Node Pruning 

3.2: If its group does not have the tie-breaker processor, If the processor is not the tie breaker, then it slays in Stage 

then the processor halts itself immediately, step II until it gets a message from the tie-breaker processor 

#_1850. #_112 or another processor #_LL2 in Stage III with its 

4. If its group has exactly one processor and exactly two pruning result variable set (or regroup restarts after a stall- 
processors existed before this regroup operation began, 25 detection time-out). As scon as a processor #_112 gets such 
then a Stage III packet, it enters Stage III and sets its local 
4.1: If the processor is the tie-breaker processor, then pruning result variable to the value found in the Stage III 

the processor continues operations, steps #_1860 packet it received. 

and #_1865. The tie breaker has additional Stage II responsibilities of 

4.2: If the processor is not the tie-breaker processor, 30 collecting connectivity information, deciding when to stop 

then the processor attempts to survive: The processor collecting the information and pruning the connectivity 

first checks the state of the tie-breaker processor, step graph to determine the final group of processors #_112 that 

#_1870. (In one embodiment, the processor requests survive the regroup operation. 

a service processor (SP) to get the status of the tie In stages I and II, the connectivity information builds up 

breaker. The SP may have independent knowledge 35 on all processors # 112 in their respective memory-resident 

about the status of the tie breaker and may be able to connectivity matrices C as the processors #_J12 exchange 
return that status. The status returned is one of the Regroup messages containing copies of the memory- 
following five values: The processor is halted (or resident matrices C. The tie breaker collects connectivity 
running non-operational code); the processor is in a information along with all the other processors #_112. 
hardware-error (self-check) freeze state; the proces- 40 The tie breaker decides when to stop collecting the 
sor is running NonStop Kernel®; the SP is commu- connectivity information. It gives all processors #_112 a 
nicating with the processor but for some reason reasonable amount of time to send Regroup messages and 
cannot get the processor's status; and the communi- thereby establish connectivity. If the tie breaker were to stop 
cation of the status request failed for some reason.) collecting information too soon, the connectivity graph built 
If the tie breaker has halted or is in a hardware-error 45 might be incomplete, resulting in available processors 
freeze state, then the processor survives, steps #__112 being declared down and pruned out in order to 
#_1880 and #_1865. If the state of the successfully satisfy the full connectivity requirement Incomplete con- 
communicating tie breaker cannot be determined nectivity information does not violate the requirements that 
(e.g., the SP request failing due to an SP connection the final surviving group be consistent on all processors 
failure, the SP replying that it cannot determine the 50 #_112 and fully connected, but it can take out processors 
condition of the NonStop Kernel® tie breaker, or the #_112 that could have been saved, 
multi-processor system not including the equivalent In one embodiment, the tie breaker waits 3 regroup ticks 
of service processors), step #_1890, then the pro- (spaced 300 milliseconds apart) after completing the split- 
cessor checks tbe mask of unreachable processors. If brain methodology (and selecting itself as the tie breaker) 
the tie breaker is not marked unreachable, the pro- 55 before proceeding to apply the node-pruning methodology. 

cessor assumes the tie breaker is malatose and Since each processor # 112 transmits Regroup messages to 

survives, steps #_J.895 and #_1865. If, however, all processors #_112 at each Regroup tick and whenever its 

the tie breaker is marked unreachable, the processor regroup stage changes, this three-tick delay allows each 

assumes that the tie breaker is healthy and applying processor #_112 at least four chances to send messages 

this methodology. It halts operations, steps #_J895 60 containing connectivity information: once when Stage I is 

and #_1897. entered, once when Stage II is entered, and twice more while 

This split-brain avoidance methodology could lead a the tie breaker waits. In addition, messages are sent on all 

processor #_112 to halt itself. Indeed, even the tie-breaker redundant paths. 

processor #_112 may halt itself. Therefore, if the processor Thus, the tie breaker stops collecting connectivity infor- 

#_U2 survives the application of the split-brain avoidance 65 mation when the first of the following two events occurs: (1) 

methodology, it again selects a tie-breaker processor #„112. its memory-resident connectivity matrix C indicates that all 

In a preferred embodiment, each processor # „J12 selects paths are up (i.e., there is full connectivity) or (2a) a 
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predetermined number of regroup ticks have elapsed since 
the completion of the application of the split-brain avoid- 
ance methodology or (2b) for multi-processors systems not 
implementing the split-brain avoidance protocol, a prede- 
termined number of regroup ticks have elapsed since the 
determination that all Stage I processors have entered Stage 
II. 

After the tie-breaker processor # 112 stops collecting 

connectivity information, the tie breaker applies the pruning 
process and comes up with the final group of surviving 
processors #_112. Note that the tie breaker can prune itself 
out without affecting the efficacy of the pruning methodol- 
ogy. Hie tie breaker always has the responsibility of inform- 
ing the other processors #_112 of its decision. The pruned 
processors #_112 (including the tie breaker) do not halt 
until they enter Stage IV. 

To get a fully connected graph from the potentially 
partially connected graph of surviving processors, the tie- 
breaker processor #_112 first runs a process that lists all the 
maximal, fully connected subgraphs. It then uses a selection 
process to pick one from the set of alternatives. 

In one embodiment, these processes run in interrupt 
context on the tie-breaker processor #_112 and have low 
upper bounds for execution time and memory requirements. 
Tlie process that lists all the candidate subgraphs requires a 
large amount of memory and execution cycles if the number 
of disconnects is large. Therefore, if the number of discon- 
nects is larger than a fixed number (8 in one embodiment), 
then a simpler scheme that picks a fully connected graph that 
is not necessarily optimal is preferred. 

The method for generating the complete list of maximal, 
fully connected subgraphs in a graph represented by a 
connectivity matrix is described below. 

The input is the NxN connectivity matrix C described 
above. The output is an array of sets of processors that form 
maximal, fully connected subgraphs. 

The methodology uses the following property: When the 
edge (i j) is removed (forming the disconnect (i j)) from a 
fully connected graph that includes vertices i and j, two 
maximal, fully connected subgraphs are formed. One sub- 
graph is the original graph with vertex i (and the edges 
connected to it) removed and the other subgraph is the 
original graph with vertex j (and its edges) removed. 

A partially connected graph can be viewed as a fully 
connected graph to which a set of disconnects has been 
applied. To compute the set of all maximal, fully connected 
subgraphs, a processor #_112 first makes a list of the 
disconnects in the connectivity matrix C. Next, the processor 
#_112 makes an initial solution set that has one member — a 
fully connected graph with all the vertices in the original 
graph. The processor #_112 then successively improves the 
solution set by applying the disconnects one by one. 

The method has the following steps: 

1. Compute the set of all dead processors, that is, the set 
of all processors i such that C(i,i) is FALSE. 

2. Convert the connectivity matrix into canonical form: 
Remove rows and columns corresponding to dead 
processors, and make the matrix symmetric. 

3. Compute the set of all disconnects, the set of pairs (i j) 
such that C(U) is TRUE, C(jj) is TRUE (that is, 
processors i and j are alive) and C(i j) is FALSE. Let D 
be the size of the set of disconnects. 

4. The variable groups is the solution array and the 
variable numgroups is the number of entries in the 
solution array. Start with an initial solution that con- 
tains one group that is equal to the set of live proces- 
sors. 



groups : o live_processors: 
numgroups : = 1; 



/* groups is an array 
of SETsV 

/* number of elements in the 
array"/ 



10 



15 



20 



30 



All live processors #_112 are initially assumed to be 
fully connected. Each disconnect is applied in turn, 
breaking the groups in the array into fully connected 
subgroups. 

5. Process each disconnect by applying it to the current 
elements in groups. 

Applying a disconnect (i j) to a group of processors 
#_112 that does not contain processor i or j has no 
effect Applying the disconnect (i,j) to a group that 
contains both processors i and j splits the group into 
two fully connected subgroups, one the same as the 
original with processor i removed and the other the 
same as the original with processor j removed. 
When a group thus splits into two subgroups, the proces- 
sor #_112 examines each of the new subgroups to see 
whether it already exists or is a subset of an already existing 
group. Only new and maximal subgroups are added to the 
array of groups. 

Following is sample C code to perform this methodology. 
The sample code assumes a function group exists_or__is_ 
subsetO to check if a given group is a member of the current 
set of groups or is a subset of an existing group. It also 
assumes a function library that implements the set type (a 
type SET and functions SetMemberQ, SetCopyQ, 



SetDeleteO and SetSwapQ). 



for(i 
{ 



40 



45 



0; i < D; i++) 
for (j = 0; j < numgroups; j++) 
{ 



/* go through the disconnects*/ 

/* go through the groups 
generated so far */ 



P Split group j if it has both vertices of 
disconnect L7 

if (SctMembcr^groupsljldisconnectslilO]) && 
SetMemberCgxoupsO^disconnectslill])) 



{ 



55 



60 



65 



f* Wc need to remove group j and replace it 
with two new groups. This is done by modifying 
group j in place and adding a new group at the 
end of the array.*/ 
numgroups ++; 

/' copy group j to the end of the array*/ 
SetO>py(groups[j]>groups[numgroup$ - ID; 
/* remove the first vertex from group j */ 
SetDelete(groups[)l > disconnects{ij0]); 
/* remove the second vertex from group added at 
the end of the array*/ 

SetDelete (groups [numgroups - lj, disconnects 

mi 

/* Check if the new groups already exist or are 
subgroups of existing groups.*/ 
/* First check the group added at the end. */ 
if (group _ex is ts_or_is_subsct(groups, 

numgroups - 1, groups[numgroups - 1])) 

numgroups — ; 
/* Now check the updated group j. First, 
switch it with the last element of the array. 
To remove it, simply decrement the array 
count*/ 

/* The j - th entry has been switched; it has to 
be examined again */ 

SctSwap(groups[j], gro up s[ numgroups - lj); 

j— ; 
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-continued cessors examined earlier since they need to have con- 

nectivity to fewer processors to be added to the group.) 

if (group.crists.or.is.subsetCgroups, 3 When all processors have been examined, the group is 

num^--; groupslllum8roup8 " m complete. This group survives this regroup incident. 

} * 5 The tie breaker then enters the next stage (Stage III) of the 

} regroup operation. 

} Stage III 

^ — — - When the tie breaker enters Stage III, according to the 

Now, numgroups is the number of maximal, fully con- node Pining protocol, it additionally sets the Regroup 

nected subgraphs, and groups contains these subgraphs. 10 message pruning result variable to the group selected to 

From the set of subgroups thus found, one group survives. survive. The tie breaker then informs all other processors 

If one treats all processors the same, the best candidate for # - that . U haS cntercd . S '?S c 111 bv scn ^ thcm mc value 

survival can be defined as the one with the greatest number of Pruning result variable. 

of members. In case of a tie, an arbitrary one can be picked. l ° ™> each P™**™ #-J" informs all processors 

In one embodiment, processors have different survival 15 (eluding the pruned out ones) that it is in Stage ID and 

priorities based on the kinds of services each provides. For f*?* ^ b ^ akcr 8 / m ™? declS1 °"- . f a # -?? cc f >r 

f«et fl «™ ;« tK & ct™ *^™i<&> t ~~Li„ r> 1^ #_U2 finds itself pruned out, it does not halt until it enters 

nSrvn^n T P i t Tu? S° P k SUgp IV. To guarantee that all processors # 112gettoknow 

UNIX (LCU) operating system software available from the ^ m dQ( J me ^ ou f cessors 

assignee of the instant invention, processors that have a #U2 partidpate m relaying the pmnmg decision . 

pnmary or backup SSYSTEM process (a process providing 2 o Stage IV 

a system-wide service) have a higher survival priority. A processor #_112 in Stage III enters Stage IV when it 

As another example, the lowest-numbered processor can determines that all of the processors #_112 known to be 

have the highest survival priority, as explained above. available in Stage II have entered Stage III. This means that 

The execution speed of this node-pruning process a u processors #_112 in the connected group have been 

depends on the number of disconnects D and the number of 2 s informed of the pruning decision. The processor #_J12 can 

fully connected groups G. For a given D, the order approxi- now commit to the new surviving group. Aprocessor #_112 

mates dearly, the worst case order is too Large to that finds itself pruned out stays in Stage III until it hears that 

attempt for the example sixteen-processor system, but this is a processor #_112 that was not pruned out has entered Stage 

small for very small values of D. In real life, very few jy. The pruned out processor #_112 then halts, since that 

disconnects, if any, are expected. 30 survivor processor #_112 in Stage IV can ensure that all 

In a preferred embodiment, when either N (number of live other survivors will enter Stage IV. (The tie-breaker proces- 

nodes) or D (number of disconnects between live nodes) is so r #_U2 that executed the node pruning can now halt if it 

less than, e.g., 8, the above process for listing groups is used. was not among the survivors. The tie breaker's role in the 

This limits the number of groups generated and examined to current regroup operation is complete.) 

256. 35 As a surviving processor enters Stage IV, it sets its 

However, when the number of disconnects and maximal OUTER_SCREEN and INNER_SCREEN #_730 and 
fully connected subgraphs is large (e.g., greater than 8), #_740 to reflect the pruning result, selects the lowest- 
processes listing all groups become too time consuming to numbered surviving processor #_112 as indicated by the 
execute in an interrupt context. Since disconnects result pruning result variable as the tie breaker for use in the next 
from rare, multiple failures, picking a sub-optimal group as 40 regroup operation, and cleans up any messages from and to 
the surviving group in the face of a large number of me processors #__112 that did not survive, 
disconnects is acceptable. If a regr oup operation restarts at Stage III, a processor 

Therefore, when both N and D are greater than, e.g., 8, the #_U2 checks the pruning result variable. If the processor 

tie breaker will pick one fully connected subgroup randomly #_112 finds itself pruned out, it halts. This guarantees that 

or by other simple means. 4S if an y processor #__112 has committed to the new surviving 

In the NonStop Kernel® and LCU preferred embodiments gr oup and entered Stage IV, the pruned out processors 

mentioned above, a SSYSTEM processor is considered a #_U2 do not survive the restart of the regroup operation, 

critical resource, and the tie breaker attempts to select a If connectivity is very poor, a pruned out processor (say, 

group that includes one of the SSYSTEM processors. If the processor #_U2fe) can stall in Stage ID. This can happen, 

processor running the primary SSYSTEM process is healthy, 50 for instance, if all processors #„112 with which processor 

the tie breaker picks a group that includes that processor. If, #_U2fc can communicate have also been pruned out and 

however, the processor running the primary SSYSTEM halt before processor #_1126 can enter Stage IV. When the 

process has died, but the processor running the backup processor #_112fc detects that it is not making progress in 

SSYSTEM process is alive, then a group that includes the stage III (after some number of clock ticks have passed), the 

latter processor is selected. 55 regroup operation restarts. As described above, this restart 

If both SSYSTEM processors are dead, then the tie win cause tne processor #_1126 to quickly kill itself, 

breaker selects a group that includes itself. A system with pruned out processors #_U2 that have 

The selection described above proceeds as follows: Deen isolated could briefly experience a split-brain situation 

1. Start with a group that contains a selected processor. as the surviving processors #__112 quickly complete regroup 
Select the primary SSYSTEM processor if it is healthy, so and declare the pruned out processors #_112 dead while the 
If the primary SSYSTEM processor is dead, but the pruned out processors #_U2 are stalling in Stage III. This, 
backup SSYSTEM processor is healthy, select the however, does not cause data corruption since these proces- 
backup SSYSTEM processor. Otherwise, select the tie sors #_112 suspend all I/O traffic while in stages I through 
breaker. Ill of a regroup operation. 

2. Examine each live processor. If it is connected to all 65 The pre-existing Stage III as described above constitutes 
members of the current group, add the processor to the the remainder of this Stage IV of the regroup operation of the 
group. (This process gives higher priority to the pro- invention. 
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Stages V and VI 

The pre-existing stages IV and V are renumbered V and 
VI for the regroup operation of the invention. 

Maintenance of Mask of Unreachable Processors 

If a processor #__112 detects that no packets are getting 5 
through on any of the redundant paths to another processor 
#_U2, it sets to logical TRUE the bit in the mask of 
unreachable processors corresponding to that other proces- 
sor #_112. A new regroup incident, however, does not start. 
Because regroup incidents suspend general I/O, a multipro- 10 
cessor system should spend minimal time doing such recon- 
figuring. A regroup incident will start soon enough on the 
detection of missing lamAlives due to the link failure. 

The mask of unreachable processors is used in Stage II as 
described above. The mask is maintained until Stage III. 15 

When regroup is in Stage III, any node pruning has 
already happened and the new group has self -pruned accord- 
ingly. The mask is examined. If the new group contains both 
the local processor #_112 and the unreachable processor 
#__112, then the regroup operation restarts. 20 

This seemingly complicated scheme is preferable to 
restarting regroup each time a link failure is detected as the 
former prevents a regroup operation from restarting many 
times due to multiple link failures that are detected due to the 
sending of regroup packets but which actually occurred 25 
before the regroup incident started. In a preferred 
embodiment, in order to detect regroup software bugs as 
well as severe connectivity problems that get worse as 
regroup proceeds, the processor #_112 halts if the regroup 
operation restarts more than 3 times without completing 30 
once. 

If a link comes up after a regroup operation has started, its 
effect on the procedure depends on how far the procedure 
has progressed. If the link comes up in time to make the tie 
breaker consider the link operational, the link "survives" 35 
(that is, one of the processors #_112 connected by the link 
escapes certain death). Regroup packets have to go in both 
directions, and this fact has to be conveyed to the tie breaker 
before the tie breaker considers the link good. If the link 
status change happens too late in the regroup incident for the 40 
tie breaker to detect it, the link is considered down and at 
least one of the processors #_H2 connected by the link is 
killed. This exclusion is acceptable. Therefore, a link com- 
ing up event is not reported to regroup, unlike a link failure 
event. 45 

Restarts 

To make progress through the stages of a regroup 
operation, a processor #_JLL2 needs to hear from the pro- 
cessors #_112 from which it has previously heard. If a 
processor #_112 or communication link fails after a regroup 50 
operation starts, the processor #__112 can stall in any of the 
stages after Stage I. Therefore, a timer (not shown) detects 
the lack of progress. The processor #_JU2 starts the timer 
when it enters Stage II of the regroup operation and clears 
the timer on entering Stage VI when the regroup operation 55 
stabilizes. If the timer expires before the algorithm ends, the 
processor #_112 restarts the regroup operation (i.e., 
re-enters Stage I). 

After a processor #„112 commits to a new group and 
declares another processor #_112 dead, the banished pro- 60 
cessor #_112 is not allowed to come back in when the 
regroup operation restarts. A processor #_112 commits to a 
new group when it enters Stage IV. It does so only after all 
processors #_112 in the connected graph of processors 
known at Stage II have entered Stage III and have set the 65 
pruning result variable to the commit group. If the regroup 
operation restarts now, all pruned out processors #__112 kill 
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themselves since the pruning result variable indicates that 
they have been excluded. Processors #__U2 that were not in 
the connected graph (at Stage II) cannot join the group since 
they are not among the processors #_112 known at Stage II. 

Message clean up actions must be completed correctly, 
regardless of how many times the algorithm goes through 
restarts. 

Regroup and Detection of Timer Failures 

Independently of or in conjunction with the split-brain 
avoidance and/or the node-pruning protocols, a multipro- 
cessor system can detect the loss of timer expirations as 
follows: A processor #_112 running the regroup algorithm 
does not advance through Stage I until the processor #_112 
receives a timer tick. If a processor has corrupted operating 
system data structures (e.g., a time list), the regroup engine 
will not receive its periodic ticks and will not advance 
further than Stage I. Since the malatose processor # __112 
does not indicate that it has entered Stage I, the other 
processors will declare it down. The faulty processor halts 
on receipt of a Stage II Regroup message or a poison packet 
indicating that it has been eliminated. 

In the split-brain avoidance and node-pruning scenarios, 
the connectivity matrix preferably subsumes the KNOWN_ 
STAGE_n variables #_750. In these embodiments, a pro- 
cessor #__U2 does not update its connectivity matrix C until 
it receives a timer tick. 

Scenarios Revisited 

The application of the invention to the above five- 
processor and two-processor scenarios is described below. 

FIG. #„2 is a graph # _200 logically representing a 
five-processor multi-processor system #__200. The graph 
#_200 of FIG. #_2 is fully connected. When communica- 
tion faults occur dividing the system #_200 into the graph 
#_4©0 of FIG. #_4, each processor #_112 applies the 
split-brain avoidance methodology described above. The 
processor 2, for example, may notice its failure to receive an 
IamAlive message from processor 3, for example. The 
processor 2 accordingly initiates a regroup operation. In 
Stage I of that Regroup operation, the processor 2 starts its 
internal timer, resets its connectivity matrix C and suspends 
I/O activity. The processor 2 then sends a Regroup message 
and receives and compares Regroup messages, updating its 
connectivity matrix C accordingly. The processor 2 receives 
Regroup messages from processors 1 and 5, and these 
Regroup messages indicate the existence of processors 3 and 
4. When the appropriate time limit has been reached, the 
processor 2 proceeds to Stage II. 

In Stage II, the processor 2 selects the processor 1 as the 
tie-breaker processor #_112 since the processor 1 was the 

lowest numbered processor # 112 at the end of the last 

regroup operation to complete. 

The processor 2 then applies the split-brain avoidance 
methodology: The processor 2 recognizes that the group of 
processors #_112 of which it is a part has more than 
one-half of the processors that were present before this 
regroup operation started. Accordingly, the processor 2 
continues operations. 

Indeed, the group has all five of the processors 1-5 in the 
system #_400, and all five of the processors 1-5 will 
continue operations at this point. All five of the processors 
1-5 select processor 1 as the tie breaker. 

The tie-breaker processor 1 waits in Stage II until either 
a reasonable amount of time to send Regroup messages has 
passed or until its connectivity matrix C indicates that all 
paths are up. Here, by assumption, all paths are not up, and 
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the tie-breaker processor 1 waits in Stage II the reasonable to continue operations, if appropriate), the processor 1 

amount of time. It then applies the node-pruning method- proceeds to Stage II. 

ology to determine the final group of processors #_112 to [ D stage II, the processor 1 selects itself as the tie-breaker 

survive the regroup operation. It then distributes this deci- processor #_112 since it was the lowest numbered processor 

sion in a Stage III Regroup message with the node-pruning 5 #_U2 at the end of the last regroup operation to complete, 

result variable set to reflect the decision. The processors 2-5 The ^ x mcn Hcs ^ spHt-brain avoidance 

wait in Stage II until they receive this Regroup message with mcthodology: nc processor X recognizes mat the g^p of 

its pruning result variable set. processors #_112 of which it is a part has neither more nor 

Using its memory-resident connectivity matrix C as input, less than one-half of the processors #_112 that were present 

the tie breaker computes the set of all dead processors. This 10 before the regroup operation began. Its group has exacdy 

set is the null set, and a conversion of the matrix C to one-half of the pre-existing processors #_112, and the 

canonical form leaves this matrix C unchanged. The tie processor 1 uses the fact that it is itself the tie-breaker 

breaker computes the set of disconnects as {(2, 3), (2, 4), (3, processor #_U2 as the decision point to continue opera- 

2), (4, 2)} , with D=4, and applies these disconnects to the set tions. 

of live processor {1, 2 3 4 5} The resulting groups of » No , ^ ^ fe bfeaker> ^ ocessor 2 ^ , 0 

processors #_112 are {1, 3, 4, 5} and {1, 2 5}. Thus, the check , he state of the tMmAlu processor 2 (in one 

number of maximal, fully connected subgraphs is two. embodiment, using the service processors). If the state of the 

Depending on the criteria for survival, either of the two tie breaker can be determined, the processor 2 realizes that 
groups may survive. If the criterion is the largest group, then the tie breaker is healthy. The processor 2 halts, 
the tie breaker selects the group {1, 3 4, 5} for survival. If wh6R; ^ sUte of ^ ^ teaket processor t be 
the criterion is the group with the lowest-numbered detcrn ^ ed , the processor 2 checks the mask of unreachable 
processor, then e.ther group can survive (with the former processors . Notmg that the tie breaker is marked 
criteria used as a tie breaker or with one group chosen unreacha51ej me processor2 assumes that the tie breaker is 
randomly, for example). If the processor 2 is running a healthy and halts, 
high-priority process, the tie breaker may chose the group . 
{1, 2, 5} for survival. These are merely a few examples of me tie-breaker processor 1 continues operation 
the criteria disclosed in the related patent applications enu- whlle me Processor 2 nu- 
merated above or well-known within the art. Assume that the The processor 1 selects itself as the tie-breaker processor 
group {1, 3, 4, 5} survives. #_U2 and remains in Stage II until a reasonable amount of 

The tie-breaker processor communicates this decision by 30 tim ^ P*** 5 - 0** processor 2 cannot and indeed does not 

setting the node-pruning variable in the next Regroup mes- send Regroup messages as the communication fault has 

sage that it sends out. The sending of the message indicates occurred and the processor has halted.) 

that the tie breaker is in Stage III, and the receipt of that The processor 1 applies the pruning process and deter- 

message (directly or indirectly) causes the other processors s rnines the gr° u P of processors #_112 that are to survive the 

2-5 to enter into Stage III also. The pruning result variable regroup operation. Using its memory-resident connectivity 

of all processors 2-5 in Stage III hold the same value matrix C as mput, me tie breaker computes the set of aD dead 

indicating that the processors 1, 3, 4 and 5 are to continue processors, {2}, and converts its matrix C into canonical 

operations and that the processor 2 is to halt operations. form. This conversion leaves a lxl matrix C including only 

Each of the processors 1-5 relays this pruning result in the 4Q <hc processor 1. The tie breaker computes the set of discon- 

Regroup messages that it respectively originates. nects as the set {(1, 2), (2, 1)}, with D=2. However, as the 

When each of the processors 1-5 gathers Regroup mes- ^ { of live W does ***** the processor^ 2, 

sages indicating that all of the processors #_112 known to ^ ese d ^™f ts to that set has no effect The 

it in Stage II have entered Stage III, then the processor enters nmnber ° f max ^^ ™? connected graphs is one, and the 

Stage IV and commits to the pruning result. At this stage, 45 ^ tarter sets its pruning result variable to indicate that 

processor 2 halts operations. Hie regroup operations con- ^ 11 ™ U «™- m Ue breaker communicates this result 

tinues to completion. Hie maximal, fully connected group of m lts ^f^ R ^ rou P messages and thus passes through 

processors 1, 3, 4 and 5 continues operation as the newly Stages III and IV Hie system #_500 completes the regroup 

reconfigured system operation and continues operations with only the processor 

1 running. 

likewise, FIG. #_3 is a graph #_300 logically repre- 50 _ „ . , . 

senting a two-processor multi-processor system #_300. Hie Fm * U * ^"f lder ^ ^ ^ mum -P rocessor s ? s - 

graph #_300 of FIG. #_3 is fully connected. When com- *™ # - 2 °°' *° w ' Pressor * experiences a corruption 

munication faults occur dividing the system #_300 into the f lts faik to receive timer expiration interrupts and 

graph #_500 of FIG. #_5, each processor #_112 marks the 10565 lts to ^e requisite IamAlive messages. The 

other as unreachable in the mask of reachable processors and 55 d ^ on of the ™« I f nAhve ^S 05 b * an y of thc 

applies the split-brain avoidance methodology described ^ Processors 1 or 3-5 causes a regroup operation to 

above. The processor 1, for example, may notice its failure e ^ 1D * 

to receive an IamAlive message from processor 2. The In Stage I of the regroup operation as related above, the 

processor 1 accordingly initiates a regroup operation. In processors 1-5, operating according to one embodiment of 

Stage I of that Regroup operation, the processor 1 starts its 60 me invention, each refrain from sending respective Stage I 

internal timer, resets its connectivity matrix C and suspends Regroup messages until each receives a timer expiration 

I/O activity. The processor 1 then sends a Regroup message interrupt. Thus, the processors 1 and 3-5 readily proceed to 

and prepares to receive and compare Regroup messages in send Stage I Regroup messages. 

order to update its connectivity matrix C. In this scenario, By hypothesis, the processor 2 does not receive timer 

however, the processor 1 receives no such Regroup mes- 65 interrupts and thus never sends a Stage I Regroup message, 

sages. When the appropriate time limit has been reached The other processors 1 and 3-5 update their respective 

(and if the processor 1 of itself constitutes enough resources KNOWN_STAGE_l variables #_750a (and/or their 
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respective connectivity matrices C) to reflect the healthiness 
of the processors 1 and 3-5 and the apparent death of the 
processor 2. After some predetermined amount of time has 
passed waiting for the processor 2, the processors 1 and 3-5 
proceed to Stage II. 

In Stage II, the processors 1 and 3-5 now broadcast Stage 
II Regroup messages. The processors 1 and 3-5 are healthy 
and the processor 2 is still malatose, and the Stage II 
Regroup messages eventually reflect this condition. The 
KNOWN_STAGE„2 variable #_750fc becomes equal to 
the KNOWN_STAGE_Jl variable #_750a. 

The processor 2, by hypothesis, still receives the Regroup 
messages from the processors 1 and 3-5. It eventually 
receives a Stage II Regroup message wherein the 
KNOWN_STAGE_l and _2 variables #_750a, #_7502> 
are equal and exclude the processor 2. The processor 2 
notices this type of Stage II Regroup message and halts. 

Processors 1 and 3-5 proceed through the remainder of 
the regroup operation and form the system N_200'. Now, 
instead of the IamAlives missing from the processor 2 
periodically perturbing the system N__200, the system 
N_200' excludes the processor 2 altogether. (Also, the 
processor 2 is dead and therefore harmless.) 

Of course, the program text for such software incorpo- 
rating the invention herein disclosed can exist in its static 
form on a magnetic, optical or other disk; in ROM, in RAM 
or in another integrated circuit; on magnetic tape; or in 
another data storage medium. That data storage medium 
may be integral to or insertable into a computer system. 

What is claimed is: 

1. In a multi-processor system having a plurality of 
processors each having a respective memory, a method for 
tolerating timer expiration failure in one of said plurality of 
processors, said method comprising: 

subjecting each of said plurality of processors to a method 
including respective advancement from a first to a 
second stage, initially placing said each processor in 
said first stage; 

sending status of advancement of a second of said plu- 
rality of processors; 

receiving on said one processor said status of advance- 
ment of said second processor; 

after said receiving, updating status of said one processor 
only if notification of a time expiration has occurred on 
said one processor; 

respectively advancing to said second stage each proces- 
sor which has updated its status; and 

determining that timer expirations have failed on said one 
processor when said one processor fails to advance 
from said first stage. 
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2. A computer system comprising: 
a communications network; 

a plurality of processors, communicatively connected by 
means of said communications network, each of said 
plurality of processors having a respective memory 
wherein is located a computer program for causing said 
computer system to tolerate timer expiration failure in 
one of said plurality of processors by 
subjecting each of said plurality of processors to a 
method including respective advancement from a 
first to a second stage, initially placing said each 
processor in said first stage; 
sending status of advancement of a second of said 

plurality of processors; 
receiving on said one processor said status of advance- 
ment of said second processor; 
after said receiving, updating status of said one pro- 
cessor only if notification of a time expiration has 
occurred on said one processor; 
respectively advancing to said second stage each pro- 
cessor which has updated its status; and 
determining that timer expirations have failed on said 
one processor when said one processor fails to 
advance from said first stage. 

3. An article of manufacture comprising a medium for 
data storage wherein is located a computer program for 
causing a multiprocessor system having a plurality of 
processors, each having a respective memory, to tolerate 
timer expiration failure in one of said plurality of processors 
by 

subjecting each of said plurality of processors to a method 
including respective advancement from a first to a 
second stage, initially placing said each processor in 
said first stage; 

sending status of advancement of a second of said plu- 
rality of processors; 

receiving on said one processor said status of advance- 
ment of said second processor; 

after said receiving, updating status of said one processor 
only if notification of a time expiration has occurred on 
said one processor; 

respectively advancing to said second stage each proces- 
sor which has updated its status; and 

determining that timer expirations have failed on said one 
processor when said one processor fails to advance 
from said first stage. 
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